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Abstract 

Lasso and other regularization procedures are attractive methods for variable 
selection, subject to a proper choice of shrinkage parameter. Given a set of 
potential subsets produced by a regularization algorithm, a consistent model 
selection criterion is proposed to select the best one among this preselected 
set. The approach leads to a fast and efficient procedure for variable selec- 
tion, especially in high-dimensional settings. Model selection consistency of 
the suggested criterion is proven when the number of covariates d is fixed. 
Simulation studies suggest that the criterion still enjoys model selection con- 
sistency when d is much larger than the sample size. The simulations also 
show that our approach for variable selection works surprisingly well in com- 
parison with existing competitors. The method is also applied to a real data 
set. 
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1 Introduction 



Variable selection is probably the most fundamental and important topic in linear 
regression analysis. We consider the case where a large number (even larger than the 
sample size) of candidate covariates are introduced at the initial stage of modeling. 
One then has to select a smaller subset of the covariates to fit /interpret the data. 
If the number of potential covariates is not so large (as sm all as 30), one may use 
subset selection to select significant variables (Miller, 1990l ). However, with a large 
number of covaria tes, searching on model space is computationally infeasible. Lasso 
(Ti bshiranil . Il996l ) an d other regu l arizat ion procedures (e.g., the adaptive lasso of 
IZoul (120061 ). SCAD of iFan and Lil (j200ll )) are successful methods to overcome this 
problem. A Lasso-type procedure estimates the regression coefficient vector j3 by 
minimizing the sum of the squared error and a regularization term 

||y-X/3|| 2 + AT(/3), (1) 

where X is an (n x d) non-random design matrix, y is an n— vector of responses, 
and A > is a shrinkage parameter that controls the amount of regularization. The 
regularization function T(f3) can take different forms according to different regu- 
larization procedures. The original and most popular one used in Lasso is the l\ 
norm T(/3) = X}j=il/?jl- As ^ increases, the coefficients are continuously shrunk 
towards 0. When A is sufficiently large, some coefficients are shrunk to exact 0, 
thus leading to sparse solutions. This feature makes the Lasso-type procedures very 
attracti ve for variable selec t ion. Indeed, their model selection consistency has bee n 



shown (IZhao and Yu . 
Under some conditions 
{A„} under which 



20061 ; iMeinshausen and Buhlmannl . 120061 ; IFan and Lil . 1200 ll ) : 



there exists a "proper" sequence of shrinkage parameters 



{j '■ Pj n 7^ 0} = St w.p.l when sample size n is large enough, 



(2) 



where f3 Xn = ,...,/3^") T is the regularized estimator of (3 with shrinkage parameter 
A n , and St is the true model, i.e., St is the index set of true covariates. Therefore, 
it is convenient to use the Lasso-type procedures for variable selection purposes. 

The remaining problem in practice is how to choose such proper A n . 
A widely-used crit e rion is the generalize d cross-validation criterion (GCV) 



19791 : iTibshiranil . Il996f ) 



([Craven and Wahba 

GCV for choosing A for the purpose 
vestigate d yet. Furtherm ore, for choosing shrinkage parameter for the SCAD 



However, theoretical properties of 
of variable selection have not been in- 



method (IFan and Lil . 120011 ). a regularization method closely related to Lasso, GCV 
seems to be l ikely to choose sh r inkage parameters that produce overfitted models 
(I Wang et all 120071 ) . IZou et all (120071 ) showed that the number of nonzero coeffi- 
cients is an unbiased estimate for the degrees of freedom of the lasso. As a result, 
popular model selection criteria - like AIC, BIC and C p - can be used for selecting A. 
However, theoretical properties of the selected model remain unknown. The main 
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contribution of this paper is to propose a criterion for selecting shrinkage parameters 
in order for regularization procedures to produce the true model. (Throughout this 
paper, the true model is assumed to exist. We note, however, that whether or not 
the true model exists is still a controversial issue in the model selection literature, 



sec 



Burnham and Anderson! (120021 )). 



Although regularization procedures can be used for simultaneous variable selec- 
tion and estimation, it seems to be impossible to tune the shrinkage parameter to 
achieve both model selecti on consistency an d optimal estimation at the same time. 
For an orthogonal design, iLeng et al.l (120061 ) showed that the Lasso estimator that 
is optimal in terms of estimation does not give consistent model selection. This fact 
was also shown by iPoetscher and Leebl (120091 ) for other regularized estimators. We 
are in this paper primarily concerned with the problem of variable selection, i.e., we 
use a Lasso-type procedure to produce a set of potential subsets and then select the 
best one among this preselected set using a model selection criterion. It was brought 
to our attention by a reviewer t hat th e idea of usin g Lass o as a "selector" was also 
briefly mentioned in iFriedmanl (120081 ): lEfron et al.l ( 120041 ) . They discussed an ap- 
proach to reduce estimation bias on the non-zero estimated coefficients in which 
the Lasso (with some method for choosing the shrinkage parameter) is used as a 
subset selector and then a different unpenalized procedure is used to estimate the 
coefficients w.r.t. the selected covariates. Our approach is to select the best subset 
of covariates - using a model selection criterion as the stopping rule - among a prese- 
lected set of potential subsets produced by a Lasso-type procedure. The preselected 
set consists of at most d subsets rather than 2 d possible subsets if using subset selec- 
tion. After selecting the best subset, we of course can use an unpenalized procedure 
to estimate the coefficients in order to reduce estimation bias. 

The model selection criterion we use is derived from the loss rank principle 
(LoRP ) , a general-purpose principle for model selection, introduced recently by 



Hutterl (120071 ); iHutter and Tranl ( 120101 ). LoRP selects a model that has the smallest 
loss rank. The loss rank of a model is defined as the number of other "fictitious" 
data that fit the model better t han the training data ( see Section 2 for a formal 
introduction). It was shown by IHutter and Tranl (120101 ) that minimizing the loss 
rank is a suitable criterion for model selection, since it trades off between the qual- 
ity of fit and the model complexity. LoRP s eems t o be a promising principle with 
enormous potential, leading to a rich field. iTranl ( 120091 ) demonstrated the use o f 
LoRP for selecting the ridge parameter in ridge regression. iTran and Hutterl (120101 ) 
adapted the idea of LoRP for model selection in a classification context and showed 
its close conn ection with excellent model selection tec hniques based on Rademacher 
complexities ( iKoltchinskiil . l2001t iBartlett et al.l . 120021 ). In this paper, we shall show 
that LoRP also successfully applies to selecting the shrinkage parameter for the 
purpose of variable selection. 

The main contribution of this paper is to propose a criterion, called the loss 
rank (LR) criterion, for selecting shrinkage parameters for variable selection pur- 
poses. As long as the regularization procedure in use has the consistency property 
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(jSJ), the shrinkage parameter selected by the LR criterion will produce the true model 
asymptotically with probability 1. This model selection consistency of the proposed 
criterion will be proven theoretically in the case where the number of covariates d is 
fixed and smaller than n. For cases with d^$>n, our simulation study suggests that 
this property still holds. The simulation also shows that our method for variable 
selection works surprisingly well. Benefiting from fast Zx-regularization algorithms, 
our method is able to correctly identify significant variables from thousands of can- 
didates in several CPU seconds. 

The paper is organized as follows. The main idea of LoRP is briefly reviewed in 
Section |2j The LR criterion is derived and its model selection consistency is proven 
in Section [31 Simulation studies and real-data application are presented in Section 
HJ Section [5] contains the conclusions and outlook. The proofs are relegated to the 
appendix. 



2 The loss rank principle 

In this sectio n, we give a b ri ef review of the loss ran k principle (LoRP). The reader 
is referred to Hutter ( 2007 ); Hutter and Tranl ( 2010 ) for the details. 



Let us consider a training data set D— (x,y) = {(xi,yi),...,(x n ,y n )} G {X xy) n 
from a regression model 

We first consider discrete y. Suppose that we use a model M to fit the data 
D, e.g., M is a linear regression model with d covariates, or M is a fc-nearest 
neighbors regression model. Imagine that in experiment situations we can conduct 
the experiment many times with fixed design points x. We then would get many 
other (fictitious) output y' . Observe that if the model M is complex/flexible (large 
d, small k), then M fits the training data (x,y) well and it also fits (x,y') well (with 
respect to some loss function). Here, for simplicity, we only consider the squared 
loss LossM(y\x) = \\y — y\\ 2 = J2i(yi~yi) 2 where y is the fitted vector under model 
M. Therefore the loss rank of M defined by 

Rank M (£>) := G y n : Loss M (2/» < Loss M (y\x)} 



will be large for complex M. Conversely, as argued by iHutter and Tranl ( 120101 ). if 
M is small/rigid, that both Loss M(y\x) and Lossm(z/'|^) are large also leads to a 
large loss rank. Thus, it is natural to choose a model with the smallest loss rank as 
a good model. By doing this, we trade off between the fit and the model complexity. 

In the case of continuous y, say for instance y = R, it is natural to use the 
concept of volume instead of the counting measure in the definition of loss rank, i.e., 

Rank M (£>) := Vol{y' G W n : Loss M (y'\x) < LosSAf(y|x)}. 

Consider the case of linear models where the fitted vector y is linear in y, i.e., 
y = M(x)y where the regression matrix M = M(x) depends only on x (using the 
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same symbol M for both model and regression matrix will not cause any confusion 
in the following). Then Loss A / (y\x) = \\y-My \\ 2 = y T Ay with A=(I n -M) T (I n -M) 
and the loss rank is 

Rank M (£>) = Vol{y' G R n : y' T Ay' < L} where L := y T Ay. 

Suppose at the moment that det(A)^0. The set {y' eR n :y' T Ay' <L} is an ellipsoid 
in IR n , so that its volume is 

v n L n/2 
Rank M (D) = " 

VdrtA 

where v n = 7r n / 2 /T(| + l) is the volume of the unit sphere in R n . Because the loga- 
rithm is monotone increasing and v n depends only on n, it is equivalent to consider 

LR M (D) = 2 \og(y T Ay) - \ logdet A. (3) 

Principle 1. Given a class of linear models Ai = {M}, the best model among M. is 
the one with the smallest loss rank 

M best = argmin MeA4 {f \og(y T Ay) - \ logdet A} (4) 

where A=(I n -M) T (I n -M), provided that detv4>0. 

Now we consider the case where detA = (e.g., projective regression) or A is 
nearly singular (e.g., ridge regression when the ridge parameter is very close to 0). 
In such cases, Rank^f(-D) is infinity or extremely large. Following the principle of 
ridge regression, we add, in order to prevent the loss rank from being infinity or 
extremely large, a small penalty a||y|| 2 to the loss 

LossA/(y|a;) := \\y - y\\ 2 + a\\y\\ 2 = y T S a y, S a = A + al n 

where a > is a small number to be determined later. Now, S a being not singular 
yields 

„. Tn/2 

Rank^D) = Vol{y' G R n : y ,T S a y' < L} = where L := y T S a y. 

ydet b a 

Taking logarithm and neglecting a constant independent of M, we define the loss 
rank of model M (dependent on a) as 

LR a M (D) = f \og(y T S a y) - \ logdet S a . (5) 

How do we deal with the extra parameter a? We are seeking a model of sma llest 
loss rank, so it is natural to minimize LR^(D) in a (see lHutter and Tranl (120101 ) for 
a more detailed interpretation). Therefore, we finally define the loss rank of model 

M as 

LR M (£>) = inf LRm(D) = inf {§ \og(y T S a y) - \ logdet^}. (6) 

a>0 ct>0 
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Principle 2. (Hutter and Tran, 2010) Given a class of linear models Ai = {M}, 
the best model among M. is the one with the smallest loss rank 



M 



best 



argmin Mg _ M LR M ( J D) = argmin MeA i inf {§ \og(y T S a y) - \ logdet S a } (7) 



a>0 



where S a = A + aI n = (I n - M) T (/„ -M) + aI n . 

Many attractive properties of LoRP have been pointed out: LoRP reduces to 
Bayesian model se lection in some special cases and has an interpretation in terms 
of MDL principle flHutter and Tranl . l2010h : in the classification context, LoRP has 



a close connection with (and works in some cases better than) mode l selection tech- 
niques based on Rademacher complexities ( Tran and Hutter . 2010l ) . By virtue of 
LoRP, the loss rank criterion for selecting the Lasso parameter will be derived in 
the next section. 



3 The LR criterion 

Let us go back to the v ariable selection problem in linear regression analysis. We 



build on the notation of Hutter and Tran 



Given a (large) set of d potential 



covariates Xi,...,Xd and a response variable Y, we consider the problem of choosing 
the important covariates for explaining Y as a linear function of Xi,...,Xd. It is as- 
sumed as usual that the covariates are linearly independent and that E(Y\X 1 ,...,X d ) 
is a linear combination of Xi,...,Xd with some of the coefficients are zero. We are 
primarily interested in identifying the non-zero coefficients. 

Suppose that the response vector y and the design matrix X have been centered, 
so that the intercept is omitted from models. Denote by St = {j*yjd*} anc ^ $ = 
{jif-jdo} the true model and a candidate model, respectively. Under model S, we 
can write 

y = X(3 S + ae 

where E(e) — 0, cov(e) = / n , o~>0. We shall consider /3 S = (/3<si,---;Asd) T as a point in 
R d with (3 Sj = ifj&S. Denote by B(S) := {0 S = (/3 s ,a 2 ) eR d xR + } the parameter 
space of model S and by X$ the (nxcfo) design matrix obtained from X by removing 
the jth column for all j $lS. 



3.1 The LR criterion 

Let /3 A = (/3^,...,/3^) T be the regularized estimator of (3 w.r.t. a certain shrinkage 
parameter A, i.e., $ x is the solution of dTJ. Denote by S\ = {j : f3j ^ 0} the index 
set corresponding to the non-zero coefficients, by df A = |«Sa| th e number of non-zero 
coefficients, and by X$ x the design matrix corresponding to the selected covariates. 
We assume at the moment that df a < n and further assume that matrices Xs A are 
full rank. The case where df a > n will be dealt with later on. 
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Fitting model S\ by least squares, we denote the OLS estimator and the variance 
estimator by 



|2 
5 A ll ' 



Ps x = {X T sXs x r X Xl x y ; a 2 Sx = ±\\y - X Sx (3 Sx 

respectively. The fitted vector under model S\ 

y Sx = X Sx P Sx = M Sx y with M Sx := X Sx {X T s X Sx )- x Xl x 

is, conditionally on S\, lineai0 in y. Then from fl5]), the loss rank of model S\ with 
parameter a is 

LR" = LR a Sx = f bgtfSZu) - \ logdet(^) 

where S^, = (I — M Sx ) T (I — M Sx ) +al = (l + a)I — M Sx . Because projection matrix 
M Sx has df A eigenvalues 1 and n — df A eigenvalues 0, has df A eigenvalues a and 
n-df A eigenvalues 1 + a. Thus, detS'^ = a dfA (l + a) n ~ dfA . Let p A := j/ 5 J 2 /lll/ll 2 > 
we have 

LR a x = f log y T y + f log(p A + a) - ^ log a - ^ log(l + a) . 

Taking derivative w.r.t a, it is easy to see that LR A is minimized at a m = ^ l _ p p x x ^_ dix 
provided that 1 — p\>df\/n. This condition is ensured by Assumption (A3) below. 
Finally, after some algebra, the loss rank of model S\ as defined in ([6]) can be 
explicitly expressed as 

LR A = LR^ = f log \\y\\ 2 - |KL(^||1 - p A ). (8) 

where KL(p||g) = plog- + (1 — p)logj^ is the Kullback-Leibler divergence between 
the Bernoulli distributions with parameters p,q£ (0,1). The optimal shrinkage pa- 
rameter^) A (for variable selection purposes) chosen by the LR criterion will be 

Alr e argmin A > LR A = argmax A > KL( d ^||l - p A ). (9) 

Often, LR A reaches its minimum in an interval (A;,A U ) (see Figure [1]). Any A in 
this interval produces the same model. This can be explained as follows. When 
A increases from to infinity, th e number of non-z ero coefficients of f3 x will be a 



non- increasing step function of A (jEfron et al.l . 12004 ); in other words, the covariates 



are in turn removed from the models. As a result, by its definition LR A is also a step 
function. Note that our emphasis is on variable selection rather than on coefficient 
estimation. 



1 Strictly speaking, y Sx is not linear in y because S\ depends on y. However, we can consider 
preselected subsets S\ as fixed models. If instead we first derive the LR criterion for a general 
fixed model S and then apply to S\, we get the same results. 
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3.2 Model selection consistency for fixed d 

In order to prove the model selection consistency of the LR criterion, we assume in 
this section that d is fixed and d<n. We need the following assumptions 

(Al) There exists a deterministic sequence of reference shrinkage parameters X n 
such that S\ n — >St w.p.l. 

(A2) e is Gaussian N(0,I n ). 

(A3) For each candidate A, p\ is bounded away from and 1, i.e., there are constants 
ci, C2 such that 0< c\ < p\ <C2 < 1 w.p.l. 

Comments. p\ = \\y — y Sx || 2 /||y || 2 is a measure of fit. In extreme cases where the 
resulting model S\ is too big or too small, p\ will be close to and 1, respectively. 
Therefore, it is reasonable to consider only A in which p\ is bounded away from 
and 1. Note that for every S\ we have that 

\\y-vs x f °k 

Pa " 



\y-y Sx \\ 2 + ll£sj 2 °l x + \\ys x \?l n ' 

For A such that S\ is the true model St, (A3) follows from a mild sufficient condition 
< lim inf (-||y 5 || 2 ) < lim sup (-llj/s || 2 ) < oo and al — > a 2 > w.p.l 

where y s is the fitted vector under the true model. Moreover, if the intercept is 
included in the models, we have that n(y) 2 < ||^ 5a || 2 < \\y\\ 2 - (A3) then follows from 
a very mild condition 

< lim inf (y) 2 < lim sup (^||?/|| 2 ) < oo and a 2 ^ — > constant > V5 w.p.l. 



» n 
n— >oo 



Assumption (Al) is satisfied by s ome regular i zation procedures, for example, Lasso 
(IZhao and Yul . 120061 ) and SCAD (IFan and Lil . l200ll ). Normality assumption (A2) is 
not a necessary condition for consistency. This assumption can be relaxed, but then 
a more complicated proof technique is needed. 
We have the following lemma. 

Lemma 3. The loss rank of model S\ can be rewritten as 

LR A = f \og(na 2 Sx ) + aff(^) + f log ^ (10) 

where H(p) :=— plogp— (1— p)log(l— p) is the entropy of p. Under Assumption (A3), 
the loss rank LK\ has the form 

LR A = |logo- 2 A + ^logn + f logn + P (l), (11) 

where Op(l) denotes a bounded random variable w.p.l. 
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Proof. With p\ = nag x /\\y\\ 2 , rearranging terms in ([5} we get (fTUl) . The fact that 
+ logp — )• 1 as p — 7-0 implies that (note that df A <<i and d is fixed) 

fif(^) = ^ logn + - log df A ) + o(l). 

Under Assumption (A3), the last term of f lTUj) is bounded. This completes the 
proof. ■ 



The above lemma is used to prove model selection consistency of the LR criterion. 

Theorem 4 (Model selection consistency of the LR criterion). Assume that d is 
fixed. Under Assumptions (A1)-(A3), the shrinkage parameter selected by the LR 
criterion will produce the true model w.p.l when n is large enough, i.e., 



P(«% 



S T ) ->■ 1 



where Alr is determined in (jUJ). 

The idea of the proof is to bound the probabilities of picking under- and overfitted 
models. A model S is said to be underfitted if S misses at least one true covariate 
(i.e., S^_St), overfitted if S contains all true covariates and at least one untrue 
(i.e., S^St)- There is a finite number of such S, so it is sufficient to prove that 
P(<S Alr —S)— >0 for each of them. The detailed proof is relegated to the appendix. 

We can of course use other model selection criteria rather than LoRP for choosing 
the best subset among the preselected set produced by the repolarizat ion procedure . 
The most widely-u s ed sel ection criteria in statistics are probably AIC (lAkaikd . Il973l ) 
and BIC ( iSchwarzl . Il978l ). AIC is asymptotically optimal in terms of loss efficiency 
but likely to select overfitted model s, whi l e BIC i s asymptotic ally optimal in terms 
of model selection consistency; see IShaol (119971 ); lYangl ( 120051 ). Therefore one may 
use BIC as another stopping rule besides LoRP. The shrinkage parameter chosen by 
BIC will be 



Abic e argmin A > BIC A where BIC A := § log£l + ^ log 



n. 



(12) 



We see from Lemma [3] that, up to a constant, the LR criterion is asymptotically 
equivalent to BIC. It follows from the proof of Theorem H] that using BIC also leads 
to the same model selection consistency, i.e., P(iS Abjc =<St) — > 1 a s n— >oo. However, 
finite-sample simulation studies in the next section show that the LR criterion works 
better than BIC, especially when d^>n. 



3.3 Large d small n 

High-dimensional variable selection problems in which d 3> n is currently of great 
interest to scientists. In order for such a problem to be solvabl e, an essential as- 
sumption needed is that it is d*— sparse (ICandes and Taol 120071 ) . i.e., the number 
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of true covariates d* must be smaller than n. Under this solvability assumption, it 
is clear that we can safely ignore irrelevant cases in which the number of covariates 
dfA under consideration is larger than n. Then the LR criterion (jHJ) is still valid. 
In practice, therefore, we propose to ignore those A under which dfA > n and apply 
the LR criterion as usual. A theoretically rigorous treatment is beyond the scope 
of the present paper, which we intend to do in a future paper. However, a system- 
atic simulation study in the next section suggests that the LR criterion still works 
surprisingly well and enjoys model selection consistency. 



4 Numerical examples 



In this section, we present simulation studies for the LR criterion, compare the LR 
criterion to other methods, and also apply it to a real data set. The regularization 
procedure we use is Lasso . The Lasso solution paths are computed by the LARS 
algorithm of lEfron et al.l (120041) . A widely-used method for choosing the Lasso 
parameter is GCV (ICraven and Wahbal . Il979l ; iTibshiranil . I19961 ) 



GCV A = - 



i \\y -X(3> 



nil- -T)¥\ 



where DF A := ti[X(X T X + \W-)- 1 X T y] } W = diag(| &|) and W~ is a generalized 
inverse of W. Another one is the BIC-type criterion of IWang et al.l ( 120071 ) (although 
its variable selection consistency requires the oracle property, a property not enjoyed 
by Lasso) 

W +DFA logn 



BICa = log 



\y 



n 



n 



Note that (3 x ^(3 Sx . The former is the Lasso estimator whereas the latter is the OLS 
estimator resulting from fitting model S\ by least squares. Our proposed criteria 
OH]) and ( JT2"j) are constructed based on (3 Sx , not /3 A . This is the essential difference 
between our approach and the others. 

Example 1: small d. We consider the following example which is taken from 



Tibshiranil (119961 ): 



y = x 1 (3 + ere 



where /3 = (3, 1.5, 0, 0, 



2, 0, 0, 0) 1 



%i are marginally iV(0,l) with the correlation 
between Xi and Xj equal to 0.5'*~ J ', e~iV(0,l). We compare the performance of LR 
and BIC criterion to that of GCV and BIG The performance is measured by the 
frequency of underfitting, overfitting and correct fitting and average number of zero 
coefficients over 100 replications. 

^Table [1] summarizes the simulation results for various factors n and a. Although 
BIC works slightly better than GCV, it still produces overfitted models most of the 
time. BIC does a good job and LR outperforms the others. 
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Table 1: The small-c? case 



a n Method Under- Correctly Overfitted(%) Ave. No. 



fitted(%) fitted(%) of zeros 

1 100 GCV 100 L57 - 

BIC 3 97 2.32 

BIC 89 11 4.88 

LR 97 3 4.97 

200 GCV 100 1.64 
BIC 100 1.81 
BIC 94 6 4.93 
LR 100 5 

3 100 GCV 100 1.3T" 
BIC 100 1.53 
BIC 1 70 29 4.22 
LR 1 77 22 4.37 

"200 gcv o o Too Tm~ 

BIC 100 2.09 

BIC 91 9 4.89 

LR 91 9 4.90 



Example 2: large d. We consider cases of large d in this example with <i = 300 
and n = 100, 200, 500. We set up a sparse recovery problem in which most of 
coefficients are zero except f3 30 = (3qo = ... = (3^00 = ^0 . The design matrix is simulated 
as in Example 1. Table [2] summarizes the simulation results for various factors 
n=100, 200, 500 and cr=l, 3. The LR criterion works surprisingly well in comparison 
with BIC and the others. 

Let us take a closer look at the simulation results in Tables [TJ12J Although the 
LR and BIC criteria are asymptotically equivalent to each other, the finite-sample 
simulation study shows that t he LR criterion works better than BIC. A similar 
situation was also observed in iHutter and Tranl (120101 ) for subset selection. This 
is probably because, contrarily to the BIC criterion, the penalty term of the LR 
criterion is data- adaptive. Some results in the model selection literature show that 
selection criteria with data- adapti v e pen alties are more encouraging than those with 
deterministic penalties; see lYangJ (120051 ) and references therein. We see that BIC 
seems to break down for the cases d > n as it always produces overfitted models, 
but starts working well when n>d. The Op(l) term in (fTTl) plays an important 
role here: it serves as a "corrector" to BIC. Not e that BIC is ju st an approximation 
to the logarithm of posterior model probability ( jSchwarzl . Il978l ). the approximation 
might be inaccurate if n is not large enough relative to d. 



Example 3: Prostate cancer data. We consider a real data set in this example. 
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Table 2: The large-c? case 



a n Method Under- Correctly Overfitted(%) Ave. No. 



fitted(%) fitted(%) of zeros 

1 100 GCV 100 90.20 

BIC 100 95.8 

BIC 100 202.01 

LR 30 70 288.24 

~200 GCV 100 87.51 

BIC 100 89.45 

BIC 100 102.02 

LR 86 14 289.83 

~500 GCV 100 97.51 

BIC 100 104.45 

BIC 40 60 287.30 

LR 100 290 

3 100 GCV 100 78.35 s " 

BIC 100 87.40 

BIC 100 202.04 

LR 18 82 287.51 

~200 GCV 100 92.02 

BIC 100 96.51 

BIC 100 102.01 

LR 58 42 289.29 

~500 GCV 100 93.31 

BIC 100 96.52 

BIC 35 65 288.35 

LR 80 20 289.75 



Stamey et all (119891 ) studied the correlation between the level of prostate antigen 



(Ipsa) and a number of clinical measures in men: log cancer volume (Icavot), log 
prostate weight (Iweight), age, log of the amount of benign prostatic hyperplasia 
(Ibph), seminal vesicle invasion (svi), log of capsular penetration (lep), Gl eason score 
(gleason), and percentage of Gleason scores 4 or 5 (pgg45). Following iTibshirani 
(119961 ). we assume a linear regression model between the response Ipsa and the 8 
covariates. We want to select a parsimonious model for the sake of scientific insight 
into the response-covariate relationship. 

The data set of size 97 is standardized so that the intercept /3q is excluded. 
Figure [Q presents the curves GCVa, BICa, LRa (1000 values of A ranging from 0.01 
to 10 in increments of .01 were used to search for the optimal A). The A selected 
by GCV, BIC are .5 and 1.1, and the corresponding models are {1, 2, 3, 4, 5, 7, 8}, 
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Figure 1: Prostate cancer data: LR A , BIC A and GCV A . 



{1, 2, 3, 4, 5, 8}, respectively. The LR criterion is minimized in the interval (3.1,5.9). 
Any value in this interval produces the same model <S LR = {1, 2, 5}. The BIC of 
these models are —19.20, —21.38, —25.19, respectively. That means the BIC also 
supports the choice of the LR criterion. (Note however that this does not mean that 
the BIC is an optimal criterion). 



5 Conclusions and outlook 



Regularization procedures are efficient methods for variable selection, subject to a 
proper choice of shrinkage parameter. By virtue of LoRP, a general-purpose principle 
for model selection, the LR criterion for variable selection in linear regression analysis 
was proposed. Variable selection consistency of the suggested criterion was pointed 
out theoretically and experimentally. Both theoretical and experimental results show 
that the proposed criterion is a very encouraging procedure for variable selection 
problem, especially in high-dimensional settings. Regularization procedures have 
now been extended to genaralized linear models and beyond, we intend to extend 
our approach to such frameworks in future work. 



M. N. Tran, Department of Statistics and Applied Probability, National University 
of Singapore, Singapore 117546. 
E-mail: ngoctm@nus.edu. sg 
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Appendix 



Proof of Theorem^ The main idea of the proof is taken from Chamba3 ( 2006 ). Let 



us denote by Zi = (xn,...,Xid,yi) the z-th observation and by 7(.,m,<r 2 ) the density 
of the Gaussian distribution with mean m and variance a 2 . Under model S, the 
density of z { is Pe s (zi) ='y(y i ,J2 j( - s (3 j x ij ,a 2 ). The log-likelihood is 



n 



Ws) = J2 h SPe s (zi) = -I log(27r) - f log a 2 - & - J^/J 

i=l i=l jeS 



It is easy to see that 



sup U0) = -floga|-f(l + log(27r)). 

6»ee(5) 



By (TTTT) . the loss rank of model S\ now can be written as 



LR A = - sup l n {9) + ^logn + C{n) + P {l) 
e&e(s x ) 



where the constant term C(n) = |logn — |(l + log(27r)) is independent of S\. 

No underestimation. It is sufficient to prove that P(5 Alr =S) — >0 for each S^St, 
as there is only a finite number of such S. 

nsx LB =s) = p(<% lr = <s,lr Alr <lr a j 

= P(i sup l n {9)-l sup l n (9)> l ^(M XhR -df Xn ) + o P (l),S XhK = S 



0ee(s A „) 

sup /, 

eee(s T ) 



< 


P( 


~ sup l n (9) 






v eee(S) 


< 


P( 


i sup Z n (6>) 






v eee(S) 


< 


P( 


i sup Z n (0) 









(13) 



where 8* G <Sx denotes the true parameter. By the law of large numbers for the 
supremum of the likelihood ratios (see, e.g., Lemma Bl of lChambazI (bOQfih ) 



i sup l n (9) - H n (9*) ^ - inf KL(pe4Po) w.p.l. 

' 6»e0(5) 068(5) 

Because S^>St, mi dG Q^KL(p e *\\p e )>0. This, together with the fact that ^ff (l^l- 
ti*)— >-0 and Assumption (Al), shows that the left-hand side term of ([TBI goes to 
as n— >oo. 

No overestimation. Fix an overfitted model S^St, let us denote by 
H(9) := KL(p e * \\ Pe ) = E[Ml n {9*) - l n (9))\ > V0 E 0(5) 



14 



(H(9) is not necessarily positive) and h n (9):= ln ^^"/2 with convention jj = 0. For 
every #G0(S) 

l n (6)-l n {6*)+nH(e) = l n (e)-l n {e*)-E[l n (6)-l n {6*)\ 

= H{df'\h n {e) - Eh n {6)) 

< H(6) 1/2 sup (h n (v)-Eh n (v)). (14) 

uee(s) 

By Q(St) cO(iS) and the property of supremum, for every e>0 there exists # o G0(iS>) 
such that 



and also 



From ([U]) and flU 



sup (Z n (0) - Z n (0*)) < l n (9 ) - l n (6*) + e 
eee(s) 



/n(e ) - an > o. 



(15) 



(16) 



sup (l n (6) - l n (6*)) < HiO,) 1 ' 2 sup {h n {6) - Eh n {6)) + e. (17) 
6>ee(S) 6>ee(S) 



From ffl~6l) and (EE 



nH(e Q ) < l n (6 ) - k 



nH(9 ) < H(8 ) 1/2 sup (h n (6) - Eh n (9)) 

6»eO(5) 



or 



nH{9,) l / 2 < sup {h n {6)-Eh n {9)). 

6»eO(5) 



Now, since e>0 was chosen arbitrarily ( fTTj) and ( |T8i) yield 



supZ n (0)- sup Z n (0)<sup{Z n (0)-Z n (0*)}<± sup (M^)-^nW) • (19) 
0(5) e(5 T ) e(5) Yeee(5) y 

We need the foll owing bounded law of the itera ted logarithm which is a consequ ence 
of Theorem 4.1, iDudley and W.PhilippI ( 119831 ) or Lemma B2, IChambazI ( 120061 ). 



Lemma 5. There is a finite constant C so that 

sup ee0((S) \h n (9)-Eh n { ,, 
hmsup = <C w.p.l. 

Now for every overfitted model S^St, it is sufficient to prove that P(5^ lh 
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S)->0. In fact, 



iS, LR? <C LR,\ 



< P sup Z n (0) - sup Z n (0) > ^p{\S\ - df A J + Op(l) 



e(5) 



0(5a„) 



< P ( sup Z n (0) - sup l n {6) > ^(|«S| - tf) + O p (1 
k e(5) e(s T ) 

-sup@ (5) J n (0) - sup e(5T) Z n (0) 



< P 



log 


;loj 






<i* 
2 


log 


n . 




log 


;loj 






d* 
1 


log 


n ■ 





log log n 
su Pe(5) \K{0) ~ Eh r 



>gl-l + op(l) )+P(S Xn ^S T ) 



y/ n log log n 



> 



l + o P (l) 



P(S Xn ? S T \20) 



where the last inequality follows from ( IT9l) . Observe that |«S| >d* as S^St- This, 
together with Lemma |5] and the fact that loglogn/( ylogn) — > 0, implies that the 
first probability of (!20|) goes to zero. The second probability of ( 1201) also goes to 
zero because of Assumption (Al). This completes the proof. ■ 
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